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Abstract. We present a general theory and corresponding declarative model for 
the embodied grounding and natural language based analytical summarisation 
of dynamic visuo-spatial imagery. The declarative model —ecompassing spatio- 
linguistic abstractions, image schemas, and a spatio-temporal feature based lan¬ 
guage generator— is modularly implemented within Constraint Logic Program¬ 
ming (CLP). The implemented model is such that primitives of the theory, e.g., 
pertaining to space and motion, image schemata, are available as first-class ob¬ 
jects with deep semantics suited for inference and query. We demonstrate the 
model with select examples broadly motivated by areas such as film, design, ge¬ 
ography, smart environments where analytical natural language based extemali- 
sations of the moving image are central from the viewpoint of human interaction, 
evidence-based qualitative analysis, and sensemaking. 

Keywords: moving image, visual semantics and embodiment, visuo-spatial cog¬ 
nition and computation, cognitive vision, computational models of narrative, declar¬ 
ative spatial reasoning 


1 Introduction 


Spatial thinking, conceptualisation, and the verbal and visual (e.g., gestural, iconic, di¬ 
agrammatic) communication of commonsense as well as expert knowledge about the 
world —the space that we exist in— is one of the most important aspects of every¬ 
day human life | Tversky[ 2005 [[2004 Bhatt 20131. Philosophers, cognitive scientists, 
linguists, psycholinguists, ontologists, information theorists, computer scientists, math¬ 
ematicians have each investigated space through the perspective of the lenses afforded 
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by their respective field of study [Freksa 2004 [Mix et al.| 2009 Bateman 2010 Bhatt 


|2012| [Bhatt et [201 3 aj [Waller and Nadel 201 3| . Int^disciplinary studies on visuo- 

spatial cognition, e.g., concerning ‘visual perception’, ‘language and space’, ‘spatial 
memory’, ‘spatial conceptualisation’, ‘spatial representations’, ‘spatial reasoning’ are 
extensive. In recent years, the fields of spatial cognition and computation, and spatial 
information theory have established their foundational significance for the design and 
implementation of computational cognitive systems, and multimodal interaction & as¬ 
sistive technologies, e.g., especially in those areas where processing and interpretation 


of potentially large volumes of highly dynamic spatio-temporal data is involved j Bhatt 
2013|: cognitive vision & robotics, geospatial dynamics [Bhatt and Wallgriln 20141, 


architecture design j Bhatt et al. 20141 to name a few prime examples. 

Our research addresses ‘space and spatio-temporal dynamics' from the viewpoints of 
visuo-spatial cognition and computation, computational cognitive linguistics, and for¬ 
mal representation and computational reasoning about space, action, and change. We 
especially focus on space and motion as interpreted within artificial intelligence and 
knowledge representation and reasoning (KR) in general, and declarative spatial rea¬ 


soning [Bhatt et al. 2011 Schultz and Bhatt 2012 Walega et al. 20151 in particular. 


Furthermore, the concept of image schemas as “abstract recurring patterns of thought 
and perceptual experience" [Johnsonj [1990[ [Lakoffj [1990| serves a central role in our 
formal framework. 


Visuo-Spatial Dynamics of the Moving Image The Moving Image, from the view¬ 
point of this paper, is interpreted in a broad sense to encompass: 

multi-modal visuo-auditory perceptual signals (also including depth sensing, haptics, 
and empirical observational data) where basic concepts of semantic or content level 
coherence, and spatio-temporal continuity and narrativity are applicable. "i 


As examples, consider the following: 

► cognitive studies of film aimed at investigating attention and recipient effects in 
observers vis-a-vis the motion picture [Nannicelli and Taberham 201 4[[Ardama[[2045) 

► evidence-based design [Hamilton and Watkinsj 2009 Cama 2009| involving analy¬ 
sis of post-occupancy user behaviour in buildings, e.g., pertaining visual perception of 
signage 


► 

sc 


geospatial dynamics aimed at human-centered interpretation of (potentially large- 
ale) geospatial satellite and remote sensing imagery j Bhatt and Wallgrtinl 2014| 


► cognitive vision and control in robotics, smart environments etc, e.g., involving 
human activity interpretation and real-time object / interaction tracking in professional 


and everyday living (e.g., meetings, surveillance and security at an airport) [Vernon 

2006 

2008 Dubba et al. 2011 Bhatt et al. 2013b |Spranger et aL[ 2014 Dubba et al. 

2015 



Within all these areas, high-level semantic interpretation and qualitative analysis of the 
moving image requires the representational and inferential mediation of (declarative) 
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embodied, qualitative abstractions of the visuo-spatial dynamics, encompassing space, 
time, motion, and interaction. 


Declarative Model of Perceptual Narratives With respect to a broad-based under¬ 
standing of the moving image (as aforediscussed), we define visuo-spatial perceptual 
narratives as: 

declarative models of visual, auditory, haptic and other (e.g., qualitative, analytical) 
observations in the real world that are obtained via artificial sensors and / or human 
input. * 

Declarativeness denotes the existence of grounded (e.g., symbolic, sub-symbolic) mod¬ 
els coupled with deep semantics (e.g., for spatial and temporal knowledge) and sys¬ 
tematic formalisation that can be used to perform reasoning and query answering, em¬ 
bodied simulation, and relational learning]^ With respect to methods, this paper par¬ 
ticularly alludes to declarative KR frameworks such as logic programming, constraint 
logic programming, description logic based spatio-terminological reasoning, answer- 
set programming based non-monotonic (spatial) reasoning, or even other specialised 
commonsense reasoners based on expressive action description languages for handling 
space, action, and change. Declarative representations serve as basis to externalise ex¬ 
plicit and inferred knowledge, e.g., by way of modalities such as visual and diagram¬ 
matic representations, natural language, etc. 

Core Contributions. We present a declarative model for the embodied grounding of 
the visuo-spatial dynamics of the moving image, and the ability to generate correspond¬ 
ing textual summaries that serve an analytical function from a computer-human inter¬ 
action viewpoint in a range of cognitive assistive technologies and interaction system 
where reasoning about space, actions, change, and interaction is crucial. The overall 
framework encompasses: 

(FI), a formal theory of qualitative characterisations of space and motion with deep 
semantics for spatial, temporal, and motion predicates 

(F2). formalisation of the embodied image schematic structure of visuo-spatial dynam¬ 
ics wrt. the formal theory of space and motion 

(F3). a declarative spatio-temporal feature-based natural language generation engine 
that can be used in a domain-independent manner 

The overall framework (F1-F3) for the embodied grounding of the visuo-spatial dynam¬ 
ics of the moving image, and the externalisation of the declarative perceptual narrative 
model by way of natural language has been fully modelled and implemented in an elab¬ 
oration tolerant manner within Constraint Logic Programming (CLP). We emphasize 
that the level of declarativeness within logic programming is such that each aspect per¬ 
taining to the overall framework can be seamlessly customised and elaborated, and that 
question-answering & query can be performed with spatio-temporal relations, image 

Broadly, we refer to methods for abstraction, analogy-hypothesis-theory formation, belief re¬ 
vision, argumentation. 



4 


Jakob Suchan, Mehul Bhatt, and Harshita Jhavar 



Fig. 1: Analysis based on the Quadrant system (Drive 2011) 


schemas, path & motion predicates, syntax trees etc as first class objects within the 
CLP environment. 

Organization of the Paper. Sectionj^presents the application scenarios that we will 
directly demonstrate as case-studies in this paper; we focus on a class of cognitive inter¬ 
action systems where the study of visuo-spatial dynamics in the context of the moving 
image is central. Sections [3]-^present the theory of space, motion, and image schemas 
elaborating on its formalisation and declarative implementation within constraint logic 
programming. Sectionj^presents a summary of the declarative natural language gener¬ 
ation component. Sectionj^concludes with a discussion of related work. 


2 Talking about the Moving Image 


Talking about the moving image denotes; 

the ability to computationally generate semantically well-founded, embodied, multi¬ 
modal (e.g., natural language, iconic, diagrammatic) externalisations of dynamic 
visuo-spatial phenomena as perceived via visuo-spatial, auditory, or sensorimotor 
haptic interactions. * 

In the backdrop of the twin notions of the moving image & perceptual narratives (Sec¬ 
tion [T]), we focus on a range of computer-human interaction systems & assistive tech¬ 
nologies at the interface of language, logic, and cognition; in particular, visuo-spatial 
cognition and computation are most central. Consider the case-studies in (Sl-S4)0 

^ The paper is confined to visual processing and analysis, and ‘talking about it’ by way of natural 
language externalisations. We emphasise that our underlying model is general, and elaboration 
tolerant to other kinds of input features. 
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(SI). Cognitive Studies of Film Cognitive studies of the moving image —specifically, 
cognitive film theory — has accorded a special emphasis on the role of mental activity 
of observers (e.g., subjects, analysts, general viewers / spectators) as one of the most 
central objects of inquiry |Nannicelli and Taberham 2014 Aldama 2015| (e.g., expert 
analysis in Listing LI; Fig[^. Amongst other things, cognitive film studies concern mak¬ 
ing sense of subject’s visual fixation or saccadic eye-movement patterns whilst watch¬ 
ing a film and correlating this with deep semantic analysis of the visuo-auditory data 
(e.g., fixation on movie characters, influence of cinematographic devices such as cuts 
and sound effects on attention), studies in embodiment | Sobchack 2004| Coegnarts and 
Kravanja 2012) . 


DRIVE (2011) I QUADRANT SYSTEM. VISUAL ATTENTION. 
Director. Nicolas Winding Rein 


This short scene, involving The Driver {Ryan Gosling) and Irene (Carey Mulligan), adopts a top-bottom and left-right quadrant system that is executed 
in a SINGLE TAKE / Without any cuts 


The CAMERA MOVES BACKWARD tracking the movement of The Driver and Irene; during movement.1, Irene occupies the right quadrant, while The 
Driver occupies the left quadrant 


Spectator eye-tracking data suggests that the audience is repeatedly switching their attention between the left and right quadrants, with a majority of 
the audience fixating visual attention on Irene as she moves into an extreme close-up shot 


Credit. Quadrant system method based on study by Tony Zhou. 


(S2). Evidence Based Design (EBD) of the 

Built Environment Evidence-based building design involves the study of the post¬ 
occupancy behaviour of building users with the aim to provide a scientific basis for 
generating best practice guidelines aimed at improving building performance and user 
experience. Amongst other things, this involves an analysis of the visuo-locomotive 
navigational experience of subjects based on eye-tracking and egocentric video capture 
based analysis of visual perception and attention, indoor people-movement analysis, 
e.g., during a wayfinding task, within a large-scale built-up environment such as a hos¬ 
pital or an airport (e.g., see Listing L2). EBD is typically pursued as an interdisciplinary 
endeavour —involving environmental psychologists, architects, technologists— toward 
the development of new tools and processes for data collection, qualitative analysis etc. 


THE NEW parkland HOSPITAL | WAYFINDING STUDY. 
Location. Dallas, Texas 


This experiment was conducted with 50 subjects at the New Parkland Hospital in Dallas 

Subject 21 (Barbara) performed a wayfinding task (#T5), starting from the reception desk of the emergency department and finishing at the Anderson 
Pharmacy. Wayfinding task #5 goes through the long corridor in the emergency department, the main reception and the blue elevators, going up to Level 
2 INTO the Atrium Lobby, passing through the Anderson-Bridge, finally arriving at the X-pharmacy 

Eye-tracking data and video data analysis suggests that Barbara fixated on passerby Person_5 for two seconds as Person-5 passes from her right in 
the long corridor. Barbara fixated most ON the big blue elevator signage at the main reception desk. During the 12th minute, video data from external 
GoPro cameras and egocentric video capture anij eye-tracking suggest that Barbara looked indecisive (stopped walking, looked around, performed rapid 
eye-movements 

B 


Credit. Based on joint work with Corgan Associates (Dallas) 
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(S3). Geospatial Dynamics The ability of semantic and qualitative analytical capa¬ 
bility to complement and synergize with statistical and quantitatively-driven methods 
has been recognized as important within geographic information systems. Research in 
geospatial dynamics | Bhatt and Wallgriin |2014| investigates the theoretical founda¬ 
tions necessary to develop the computational capability for high-level commonsense, 
qualitative analysis of dynamic geospatial phenomena within next generation event and 
object-based GIS systems. 


(S4). Human Activity Interpretation Research on embodied perception of vi¬ 
sion —termed cognitive vision | |Vernon| |2006| |2008| [Bhatt et'af) |2013b| — aims to 
enhance classical computer vision systems with cognitive abilities to obtain more ro¬ 
bust vision systems that are able to adapt to unforeseen changes, make “narrative” sense 
of perceived data, and exhibit interpretation-guided goal directed behaviour. The long¬ 
term goal in cognitive vision is to provide general tools (integrating different aspects 
of space, action, and change) necessary for tasks such as real-time human activity in¬ 
terpretation and dynamic sensor (e.g., camera) control within the purview of vision, 
interaction, and robotics. 


3 Space, Time, and Motion 


Qualitative Spatial & Temporal Representation and Reasoning (QSTR) [Cohn and Haz- 


arika 20011 abstracts from an exact numerical representation by describing the rela¬ 


tions between objects using a finite number of symbols. Qualitative representations 
use a set of relations that hold between objects to describe a scene. Gabon | Gabon} 
1993 [1995 2000| investigated movement on the basis of an integrated theory of space. 


time, objects, and position. Muller | |Muller | |1998 | defined continuous change using 
4-dimensional regions in space-time. Hazarika and Cohn | Hazarika and Cohn) |2002| 
build on this work but used an interval based approach to represent spatio-temporal 
primitives. 

We use spatio-temporal relations to represent and reason about different aspects of 
space, time, and motion in the context of visuo-spatial perception as described by 
I Suchan et al.j 2014|. To describe the spatial configuration of a perceived scene and 
the dynamic changes within it we combine spatial calculi to a general theory for declar- 
atively reason about spatio-temporal change. The domain independent theory of Space, 
Time, and Motion (Zstm) consists of: 


► 5Ispace - Spatial Relations on topology, relative position, relative distance of spatial 
objects 

► Hjime - Temporal Relations for representing relations between time points and 
intervals 


► ^Motion - Motion Relations on changes of distance and size of spatial objects 
The resulting theory is given as: Zstm =def [5Ispace U Zjime U ZMotion]- 
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Fig. 2: Region Connection Caicuius (RCC-8) 
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Fig. 3: Generai Theory of Space, Time, Motion, and Image Schema 


Objects and individuals are represented as spatial primitives according to the nature of 
the spatial domain we are looking at, i.e., regions of space S = {si, S 2 , Sn}, points 
V = {pi,p 2 , ■■■,Pn}, and line segments C = {/ 1 G 2 , In] ■ Towards this we use func¬ 
tions that map from the object or individual to the corresponding spatial primitive. The 
spatial configuration is represented using n-ary spatial relations TZ = {?’i,r 2 , 
of an arbitrary spatial calculus. <P = {fi, ^ 2 , fn] is a set of propositional and func¬ 
tional fluents, e.g. ^( 61 , 62 ) denotes the spatial relationship between ei and 62 . Tem¬ 
poral aspects are represented using time points T = ■■■,tn] and time inter¬ 

vals X = {ii, J 2 , ■■■, in}- Holds{(j), r, at(t)) is used to denote that the fluent f has the 
value r at time t. To denote that a relation holds for more then one contiguous time 
points, we define time intervals by its start and an end point, using betweenfi^tf)■ 
Occurs{9,at{t)), and Occurs{d,between{ti,t 2 )) is used to denote that an event or 
action occurred. 


3.1 Uspace - Spatial Relations 

The theory consists of spatial relations on objects, which includes relations on topol¬ 
ogy and extrinsic orientation in terms of left, right, above, below relations and depth 
relations (distance of spatial entity from the spectator). 

► Topology. The Region Connection Calculus (RCC) [Cohn et ^ |1997| is an ap¬ 
proach to represent topological relations between regions in space. We use the RCC 8 
subset of the RCC, which consists of the eight base relations in 7?.top (Figure |^, for 
representing regions of perceived objects, e.g. the projection on an object on the image 
plan. 
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7?.top = {dc, ec, po, eq, tpp, ntpp, tpp ntpp 


► Relative Position. We represent the position of two spatial entities, with respect 
to the observer’s viewpoint, using a 3-Dimensional representation that resemble Allen’s 
interval algebra [Allen 19831 for each dimension, i.e. vertical, horizontal, and depth 
(distance from the observer). TZpos = [7?.pos-v U 7?.pos-h U T^pos-d] 


Tlpos-M = {above, overlaps_above, along_above, vertically_equal, overlaps_below, along_below, 
below} 


TZpos-h = {left, overlaps_left, along_left, horizontally_equal, overlaps_right, along_right, right} 

TZpos-d = {closer, overlaps_closer, along_closer, distance_equal, overlaps_further, alongjurther, 
further} 


► Relative Distance. We represent the relative distance between two points pi and 
P 2 with respect to a third point p^, using ternary relations T^dist- 

7?.dist = {closer, further, same} 

► Relative Size. For comparison of the size of two regions we use the relations in 
TZslze- 

TZdist = {smaller, bigger, same} 


3.2 ^Tittle - Temporal Relations 

Temporal relations are used to represent the relationship between actions and events, 
e.g. one action happened before another action. We use the extensions of Allen’s interval 
relations | |Allen| |1983| as described by | |Vilain| |1982[ , i.e. these consist of relations 
between time points, intervals, and point - interval. 

'R-pomt = {•before*, •after*, •equals*} 

J?-intervai = {before, after, during, contains, starts, started_by, finishes, finished_by, overlaps, 
overlapped_by, meets, met_by, equal} 

7?-point-intervai = {•before, after*, •starts, started_by*, •during, contains*, •finishes, finished_by*, 
•after, before*} 

The relations used for temporal representation of actions and events are the union of 

these three, i.e. T^xime = [^point U T^interval U T^point—interval]- 


3.3 ^Motion - Qualitative Spatial Dynamics 

Spatial relations holding for perceived spatial objects change as an result of motion of 
the individuals in the scene. To account for this, we define motion relations by making 
qualitative distinctions of the changes in the parameters of the objects, i.e. the distance 
between two depth profiles and its size. 

► Relative Movement. The relative movement of pairs of spatial objects is repre¬ 
sented in terms of changes in the distance between two points representing the objects. 
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TZmoMu = {approaching, receding, Static} 

► Size Motion. For representing changes in size of objects, we consider relations on 
each dimension {horizontal, vertical, and depth) separately. Changes on more than one 
of these parameters at the same time instant can be represented by combinations of the 
relations. 

7?.size = {elongating, shortening, static} 


4 Image Schemas of the Moving Image 


Table 1: Image Schemas Identifiable in the literature (non-exhaustive list) 


SPACE 

ABOVE , ACROSS , COVERING , CONTACT , 
VERTICAL.ORIENTATION , LENGTH 

MOTION 

CONTAINMENT , PATH , PATH.GOAL , SOURCE_PATH_GOAL , 
BLOCKAGE , CENTER.PERIPHERY , CYCLE , 
CYCLIC-CLIMAX 

FORCE 

COMPULSION , COUNTERFORCE , DIVERSION , 
REMOVAL_OF_RESTRAINT / ENABLEMENT , ATTRACTION , 
LINK, SCALE 

BALANCE 

AXIS-BALANCE , POINT-BALANCE , TWIN-PAN-BALANCE , 
EQUILIBRIUM 

TRANSFORMATION 

LINEAR-PATH-FROM-MOVING-OBJECT , 
PATH-TO-ENDPOINT , PATH-TO-OBJECT-MASS , 
MULTIPLEX-TO-MASS , REFLEXIVE , ROTATION 

OTHERS 

SURFACE , FULL-EMPTY , MERGING , MATCHING , 
NEAR-FAR , MASS-COUNT , ITERATION , OBJECT , 
SPLITTING , PART-WHOLE , SUPERIMPOSITION , PROCESS , 
COLLECTION 


Image schemas have been a cornerstone in cognitive linguistics [Geeraerts and Cuyck- 


[2007 1 , and have also been investigated from the perspective of psycholinguistics. 


and language and cognitive development |Mandler 1992 |Mandler and Pagan Canova^ 
|2014| . Image schemas, as embodied structures founded on experiences of interactions 
with the world, serve as the ideal framework for understanding and reasoning about 
perceived visuo-spatial dynamics, e.g., via generic conceptualisation of space, motion, 
force, balance, transformation, etc. Table [T] presents a non-exhaustive list of image 
schemas identifiable in the literature. We formalise image schemas on individuals, ob¬ 
jects and actions of the domain, and ground them in the spatio-temporal dynamics, as 
defined in Section that are underling the particular schema. As examples, we fo¬ 
cus on the spatial entities PATH, CONTAINER, THING, the spatial relation CONTACT, 
and movement relations MOVE, INTO, OUT OF (these being regarded as highly im¬ 
portant and foundational from the viewpoint of cognitive development [Mandler and 
Pagan Canovas 2014|). 
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CONTAINMENT The CONTAINMENT schema denotes, that an object or an individual 
is inside of a container object. 

containment(entity (E), container (C)) inside (E, C). 

As an example consider the following description from the film domain described in 
Listing LI. 


Irene occupies the right quadrant, while The Driver occupies the left 

QUADRANT. 


In the movie example the ENTITY is a person in the film, namely The Driver, and the 
CONTAINER is a cinematographic object, the top-left quadrant, which is used to analyse 
the composition of the scene. We are defining the inside relation based on the involved 
individuals and objects, e.g. in this case we define the topological relationship between 
The Drivers face and the bottom-right quadrant. 

inside(person (P), cinemat_object(quadrant (Q)) 
region(person (P), P_region) , 

region(cinemat_object(quadrant (Q)), Q_region) 
topology (nttp, P_region, 0 region) . 

To decide on the words to use for describing the schema, we make distinctions on 
the involved entities and the spatial characteristics of the scene, e.g. we use the word 
’occupies’, when the person is taking up the whole space of the container, i.e. the size 
is bigger than a certain threshold. 

phrase(containment (E, C), [E, 'occupy', C] ) 

region(person (E), E_region) , 

region(cinemat_object(quadrant (C), C_region) , 
threshold (C_region, C_tresh) , 
size (bigger, E_region, C_tresh) . 

Similarly, we choose the word ’in ’, when the person is fully contained in the quadrant. 


PATH_GOAL and SOURCE_PATH_GOAL The PATH.GOAL Image Schema is used to 
conceptualise the movement of an object or an individual, towards a goal location, on 
a particular path. In this case, the path is the directed movement towards the goal. The 
SOURCE_PATH_GOAL Schema builds on the PATH.GOAL Schema by adding a source 
to it. Both Schemas are used to describe movement, however, in the first case, the source 
is not important, only the goal of the movement is of interest. Here we only describe 
the SOURCE_PATH_GOAL Schema in more detail, as the PATH Schema is the same, 
without the source in it. 
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source_path_goal (Trajector, Source, Path, Goal) 
entity (Trajector) , location (Source) , location (Goal ) , 
path (Path, Source, Goal), 

at_location (Trajector. Source, at_time (T_l )), 
at_location (Trajector. Goal, at_time (T_2 )), 
move (Trajector. Path, between (T_l, T_2)). 

In the way finding analysis one example of the SOURCE_PATH_GOAL schema is when 
a description of the path a subject was walking is generated. 

Barbara walks from the emergency, through the atrium lobby to the blue 

ELEVATORS. 

Another example is when a descriptions of a subjects eye movement is generated from 
the eye tracking experiment. 

Barbaras eyes move from the emergency sign, 

OVER the EXIT SIGN TO the ELEVATOR SIGN. 

In both of these sentences there is a moving entity, the trajector, a source and a goal 
location, and a path connecting the source and the goal. In the first sentence it is Barbara 
who is moving, while in the second sentence Barbaras eyes are moving. Based on the 
different spatial entities involved in the movement, we need different dehnitions of 
locations, path, and the moving actions. In the way hnding domain, a subject is at a 
location when the position of the person upon a 2-dimensional floorplan is inside the 
region denoting the location, e.g. a room, a corridor, or any spatial artefact describing a 
region in the floorplan. 

at_location (Subject. Location) 
person (Subject) , room (Location) , 

position (Subject, S_pos) , region (Location, L_reg) , 
topology (ntpp, S_pos, Loc_reg) . 

Possible paths between the locations of a floorplan are represented by a topological 
route graph, on which the subject is walking. 

move(person (Subject) , Path) 
action(movement (walk) , Subject, Path), 
movement (approaching. Subject, Goal). 

For generating language, we have to take the type of the trajector into account, as well as 
the involved movement and the locations, e.g. the eyes are moving ’over’ some objects, 
but Barbara moves ’trough’ the corridor. 


ATTRACTION The ATTRACTION schema is expressing a force by which an entity is 
attracted. 
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attraction (Subject. Entity) 
entity (Subject) , entity (Entity ), 
force (attraction. Subject, Entity). 

An example for ATTRACTION is the eye tracking experiment, when the attention of a 
subject is attracted by some object in the environment. 


While walking through the hallway, Barbaras attention is attracted by the 

OUTSIDE VIEW. 

In this case the entity is Barbara’s attention which is represented by the eye tracking 
data, and it is attracted by the force, the outside view applies on it. We define attraction 
by the fact, that the gaze position of Barbara has been on the outside for a substantial 
amount of time, however, this definition can be adapted to the needs of domain experts, 
e.g. architects who want to know what are the things that grab the attention of people in 
a building. 


5 From Perceptual Narratives to Natural Language 


The design and implementation of the natural language generation component has been 
driven by three key developmental goals: (1) ensuring support for, and uniformity with 
respect to the (deep) representational semantics of space and motion relations etc (Sec¬ 
tion [^; (2) development of modular, yet tightly integrated set of components that can 
be easily used within the state-of-the-art (constraint) logic programming family of KR 
methods; and (3) providing seamless integration capabilities within hybrid AI and com¬ 
putational cognition systems. 


System Overview (NL Generation) 


The overall pipeline of the language generation component follows a standard natural 


language generation system architecture | Reiter and Dale 2000 Bateman and Zock 


2003 1 . Figure [^illustrates the system architecture encompassing the typical stages of 
content determination & result structuring, linguistic & syntactic realisation, and syntax 
tree & sentence generation. 


SI. Input - Interaction Description Schema Interfacing with the language genera¬ 
tor is possible with a generic (activity-theoretic) Interaction Description Schema (IDS) 
that is founded on the ontology of the (declarative) perceptual narrative, and a gen¬ 
eral set of constructs to introduce the domain-specific vocabulary. Instances of the IDS 
constitute the domain-specific input data for the generator. 
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Fig. 4: From Perceptual Narratives to Natural Language 


S2. Syntax Tree and Sentence Generation The generator consists of sub-modules 
concerned with input IDS instance to text planning, morphological & syntanctic real¬ 
isation, and syntax tree & sentence generation. Currently, the generator functions in a 
single interaction mode where each invocation of the system (with an input instance of 
the IDS) produces a single sentence in order to produce spatio-temporal domain-based 
text. The morphological and syntactic realisation module brings in assertions of detailed 
grammatical knowledge and the lexicon that needs to be encapsulated for morpoho- 
logical realisation; this encompasses aspects such as noun and verb categories, spatial 
relations and locations; part of speech identification is also performed at this stage, in¬ 
cluding determiner and adjective selection, selection of verb and tense etc. The parts 
of speech identified by the morph analyser taken together with context free grammar 
rules for simple, complex, and compound sentence constructions are used for syntactic 
realisation, and sentence generation. 


Language Generation (Done Declaratively) 

Each aspect of generation process, be it at a factual level (grammar, lexicon, input data) 
or at a process level (realisation, syntax tree generation) is fully declarative (to the ex¬ 
tent possible in logic programming) and elaboration tolerant (i.e., addition or removal 
or facts & rules, constraints etc does not break down the generation process). An im¬ 
portant consequence of this level of declarativeness is that a query can work both ways; 
from input data to syntax tree to sentence, or from a sentence back to its syntax tree and 
linguistic decomposition wrt. to a specific lexicon. 


Empirical Evaluation of Language Generation 

We tested the language generation component with data for 25 subjects, 500 IDS in¬ 
stances, and 53 domain facts (using an Intel Core i7-3630QM CPU @ 2.40GHz x 8). We 
generated summaries in simple/continuous present, past, future respectively for all IDS 
instances. Table Q: (a), average of 20 interactions, on an average 26.2 sentences / sum¬ 
mary, with 17.6 tokens as the average length / sentence; (b) generated 100 sentences for 
simple, compound, and complex types reflecting the average sentence generation time. 
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Table 2: Time (in ms) for (a) summaries, (b) sentences 


Tense 

Avg. Min. Max. 

simple 

77.8 

70 

96 

continous 

84.48 

73 

99 


(a) 




Type 

Time 

simple 

0,52 

compound 

1,23 

complex 

1,32 

(b) 



6 Discussion and Related Work 


Cognitive vision as an area of research has already gained prominence, with several 
recent initiatives addressing the topic from the perspectives of language, logic, and arti- 
hcial intelligence | Vernon 2006| |2008| |Dubba et al.| |2011| |Bhatt et 3T] |2013b| [Spranger] 


|et al.||2014[|Dubba et al.[ 2015 1. There has also been an increased interest from the com¬ 
puter vision community to synergise with cognitively motivated methods for language 
grounding and inference with visual imagery | Karpathy and Fei-Fei 2015[ |Yu et al.| 
2015|. This paper has not attempted to present advances in basic computer 


vision re¬ 


search; in general, this is not the agenda of our research even outside the scope of this 
paper. The low-level visual processing algorithms that we utilise are founded in state- 
of-the-art outcomes from the computer vision community for detection and tracking of 
people, objects, and motion [Canny 1986[ Lucas and Kanade 1981 Viola and Jones[ 


2001 


Dalai and Triggs 2005) jOn the language front, the number of research projects 


addressing natural language generation systems | Reiter and Dale 2000 Bateman and 
|Zock|[200^ is overwhelming; there exist a plethora of projects and initiatives focussing 
on language generation in general or specihc contexts, candidate examples being the 
works in the context of weather report gener ation | Goldberg et al. 1994[ Sripada et al.[ 
2014), Pollen forecasts | Turner et al. 2006) ^Our focus on the (declarative) language 


generation component of the framework of this paper (Section]^ has been on the use 
of “deep semantics” for space and motion, and to have a unihed framework -with each 
aspect of the embodied perception grounding framework- fully implemented within 
constraint logic programming. 

Our research is motivated by computational cognitive systems concerned with inter¬ 
preting multimodal dynamic perceptual input; in this context, we believe that it is es¬ 
sential to build systematic methods and tools for embodied visuo-spatial conception, 
formalisation, and computation with primitives of space and motion. Toward this, this 
paper has developed a declarative framework for embodied grounding and natural lan¬ 
guage based analytical summarisation of the moving image; the implemented model 


For instance, we analyse motion in a scene sparse and dense optical flow |Lucas and Kanade 
1 198 H [pMnebackl |2003| , detecting faces using cascades of features [Viola and Jones[|2001| 


detecting humans using histograms of oriented gradients |Dalal and Triggs 2005). 

^ We have been unable to locate a fitting & comparable spatio-temporal feWre sensitive lan¬ 
guage generation module for open-source usage. We will disseminate our language generation 
component as an open-source PROLOG library. 
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consists of modularly built components for logic-based representation and reasoning 
about qualitative and linguistically motivated abstractions about space, motion, and im¬ 
age schemas. Our model and approach can directly provide the foundations that are 
needed for the development of novel assistive technologies in areas where high-level 
qualitative analysis and sensemaking | Bhatt et al.[ 2013a Bhatt 2013| of dynamic 
visuo-spatial imagery is central. 
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